Goto

Collaborating Authors

 sleeper agent




Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch

Neural Information Processing Systems

As the curation of data for machine learning becomes increasingly automated, dataset tampering is a mounting threat. Backdoor attackers tamper with training data to embed a vulnerability in models that are trained on that data. This vulnerability is then activated at inference time by placing a trigger'' into the model's input. Typical backdoor attacks insert the trigger directly into the training data, although the presence of such an attack may be visible upon inspection. In contrast, the Hidden Trigger Backdoor Attack achieves poisoning without placing a trigger into the training data at all. However, this hidden trigger attack is ineffective at poisoning neural networks trained from scratch. We develop a new hidden trigger attack, Sleeper Agent, which employs gradient matching, data selection, and target model re-training during the crafting process. Sleeper Agent is the first hidden trigger backdoor attack to be effective against neural networks trained from scratch. We demonstrate its effectiveness on ImageNet and in black-box settings.


Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First

Liu, Shu, Ponnapalli, Soujanya, Shankar, Shreya, Zeighami, Sepanta, Zhu, Alan, Agarwal, Shubham, Chen, Ruiqi, Suwito, Samion, Yuan, Shuo, Stoica, Ion, Zaharia, Matei, Cheung, Alvin, Crooks, Natacha, Gonzalez, Joseph E., Parameswaran, Aditya G.

arXiv.org Artificial Intelligence

Large Language Model (LLM) agents, acting on their users' behalf to manipulate and analyze data, are likely to become the dominant workload for data systems in the future. When working with data, agents employ a high-throughput process of exploration and solution formulation for the given task, one we call agentic speculation. The sheer volume and inefficiencies of agentic speculation can pose challenges for present-day data systems. We argue that data systems need to adapt to more natively support agentic workloads. We take advantage of the characteristics of agentic speculation that we identify, i.e., scale, heterogeneity, redundancy, and steerability - to outline a number of new research opportunities for a new agent-first data systems architecture, ranging from new query interfaces, to new query processing techniques, to new agentic memory stores.


Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Zanbaghi, Shahin, Rostampour, Ryan, Abid, Farhan, Jarmakani, Salim Al

arXiv.org Artificial Intelligence

Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as "sleeper agents." Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (<1s per query), requires no model modification, and provides the first practical solution to LLM backdoor detection. Our work addresses a critical security gap in AI deployment and demonstrates that embedding-based detection can effectively identify deceptive model behavior without sacrificing deployment efficiency.


Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Baker, Mohammed Abu, Babu-Saheer, Lakshmi

arXiv.org Artificial Intelligence

Recent advances in artificial intelligence (AI), particularly in the domain of large language models (LLMs), have significantly amplified concerns around AI safety and security. One critical aspect of these concerns is the vulnerability of LLMs to backdoor attacks--a malicious strategy whereby an attacker injects specific triggers into training data, resulting in "sleeper agents" that behave normally until activated by particular inputs [6]. These backdoored models (also known as sleeper agents or trojaned models) pose a serious threat as they cannot be detected by standard evaluation methods and manifest undesirable or harmful behaviors only upon exposure to particular triggers in the input [2]. Triggers can take on many forms, ranging from simple single-token lexical triggers to complex semantic triggers [9]. The significance of studying backdoor vulnerabilities arises from two primary threat models: Data-poisoned sleeper agents These involve deliberate poisoning of the training data to trigger specific harmful behaviors under attacker-defined conditions [3]. Real-world implications are substantial; for instance, autonomous vehicles might misinterpret modified road signs, potentially leading to fatal accidents, or software coding assistants might generate insecure code when prompted by certain organisations, making the organisations software systems vulnerable to attack if the generated code is not carefully inspected [3]. Deceptive instrumental alignment Plausibly, models could develop deceptive behaviors organically during training [8]. These models exhibit compliant behaviors in training and evaluation phases but deviate from their developer-defined goals once deployed. While naturally occurring deceptive models have not yet been reported, the training process does select for such behaviour [6].




Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch

Neural Information Processing Systems

As the curation of data for machine learning becomes increasingly automated, dataset tampering is a mounting threat. Backdoor attackers tamper with training data to embed a vulnerability in models that are trained on that data. This vulnerability is then activated at inference time by placing a "trigger'' into the model's input. Typical backdoor attacks insert the trigger directly into the training data, although the presence of such an attack may be visible upon inspection. In contrast, the Hidden Trigger Backdoor Attack achieves poisoning without placing a trigger into the training data at all. However, this hidden trigger attack is ineffective at poisoning neural networks trained from scratch.


Two-faced AI language models learn to hide deception

Nature

Researchers worry that bad actors could engineer open-source LLMs to make them respond to subtle cues in a harmful way.Credit: Smail Aslanda/Anadolu Just like people, artificial-intelligence (AI) systems can be deliberately deceptive. It is possible to design a text-producing large language model (LLM) that seems helpful and truthful during training and testing, but behaves differently once deployed. And according to a study shared this month on arXiv1, attempts to detect and remove such two-faced behaviour are often useless -- and can even make the models better at hiding their true nature. The finding that trying to retrain deceptive LLMs can make the situation worse "was something that was particularly surprising to us … and potentially scary", says co-author Evan Hubinger, a computer scientist at Anthropic, an AI start-up company in San Francisco, California. Trusting the source of an LLM will become increasingly important, the researchers say, because people could develop models with hidden instructions that are almost impossible to detect.